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Abstract 

This paper is concerned with an important issue in finite mixture modelling, the selection of 
the number of mixing components. We propose a new penalized likelihood method for model 
selection of finite multivariate Gaussian mixture models. The proposed method is shown to be 
statistically consistent in determining of the number of components. A modified EM algorithm 
is developed to simultaneously select the number of components and to estimate the mixing 
weights, i.e. the mixing probabilities, and unknown parameters of Gaussian distributions. Sim- 
ulations and a real data analysis are presented to illustrate the performance of the proposed 
method. 
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1 Introduction 

Finite mixture modeling is a flexible and powerful approach to modeling data that is heterogeneous 

and stems from multiple populations, such as data from patter recognition, computer vision, image 

analysis, and machine learning. The Gaussian mixture model is an important mixture model family. 

It is well known that any continuous distribution can be approximated arbitrarily well by a finite 

mixture of normal densities (Lindsay, 1995; McLachlan and Peel, 2000). However, as demonstrated 

by Chen (1995), when the number of components is unknown, the optimal convergence rate of the 

estimate of a finite mixture model is slower than the optimal convergence rate when the number 

is known. In practice, with too many components, the mixture may overfit the data and yield 

poor interpretations, while with too few components, the mixture may not be flexible enough 

to approximate the true underlying data structure. Hence, an important issue in finite mixture 
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modeling is the selection of the number of components, which is not only of theoretical interest, 
but also significantly useful in practical applications. 

Most conventional methods for determining the order of the finite mixture model are based 
on the likelihood function and some information theoretic criteria, such as AIC and BIC. Leroux 
(1992) investigated the properties of AIC and BIC for selecting the number of components for 
finite mixture models and showed that these criteria would not underestimate the true number of 
components. Roeder and Wasserman (1997) showed the consistency of BIC when a normal mixture 
model is used to estimate a density function "nonparametrically" . Using the locally conic parame- 
terization method developed by Dacunha-Castelle and Gassiat (1997), Keribin (2000) investigated 
the consistency of the maximum penalized likelihood estimator for an appropriate penalization se- 
quence. Another class of methods is based on the distance measured between the fitted model and 
the nonparametric estimate of the population distribution, such as penalized minimum-distance 
method (Chen and Kalbfleisch, 1996), the Kullback-Leibler distance method (James, Priebe and 
Marchette, 2001) and the Hellinger distance method (Woo and Sriram, 2006). To avoid the irreg- 
ularity of the likelihood function for the finite mixture model that emerges when the number of 
components is unknown, Ray and Lindsay (2008) suggested to use a quadratic-risk based approach 
to select the number of components for the finite multivariate mixture. However, these meth- 
ods are all based on the complete model search algorithm and the computation burden is heavy. 
To improve the computational efficiency, recently, Chen and Khalili (2008) proposed a penalized 
likelihood method with the SCAD penalty (Fan and Li, 2001) for mixtures of univariate location 
distributions. They proposed to use the SCAD penalty function to penalize the differences of loca- 
tion parameters, which is able to merge some subpopulations by shrinking such differences to zero. 
However, similar to most conventional order selection methods, their penalized likelihood method 
can be only used for one dimensional location mixture models. Furthermore, if some components in 
the true/optimal model have the same location (which is the case for the experiment in Subsection 
4.2 of this study), some of them would be eliminated incorrectly by this method. 

On the other hand, Bayesian approaches have been also used to find a suitable number of 
components of the finite mixture model. For instance, variational inference, as an approximation 
scheme of Bayesian inference, can be used to determine the number of the components in a fully 
Bayesian way (see, e.g., Corduneanu and CM. Bishop (2001) or Chapter 10.2 of Bishop (2006)). 
Moreover, with suitable priors on the parameters, the maximum a posteriori (MAP) estimator 
can be used for model selection. In particular, Ormoneit and Tresp (1998) and Zivkovic and van 
der Heijden (2004) put the Dirichlet prior on the mixing weights, i.e. the mixing probabilities, of 
the components in the Gaussian mixture model, and Brand (1999) applied the "entropic prior" 
on the same parameters to favor models with small entropy. They then used the MAP estimator 
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to drive the mixing weights associated with unnecessary components toward extinction. Based 
on an improper Dirichlet prior, Figueiredo and Jain (2002) suggested to use minimum message 
length criterion to determine the number of the components, and further proposed an efficient 
algorithm for learning a finite mixture from multivariate data. We would like to point out the 
significant difference between those approaches and our proposed method in this paper. When 
a component is eliminated, our suggested objective function changes continuously, while those 
approaches encounter a sudden change in the objective function because zero is not in the support 
area of the prior distribution for the mixing weights, such as the Dirichlet prior. Therefore, it is 
difficult to study statistical properties of these Bayesian approaches and, especially, the consistency 
analysis is often missing in the literature. 

In this paper, we propose a new penalized likelihood method for finite mixture models. In 
particular, we focus on finite Gaussian mixture models. Intuitively, if some of the mixing weights or 
mixing probabilities are shrunk to zero, the corresponding components are eliminated and a suitable 
number of components is retained. By doing this, we can deal with multivariate Gaussian mixture 
models and do not need to assume common covariance matrix for different components. Popular L p 
types of penalty functions would suggest to penalize the mixing weights directly. However, we will 
show that such types of penalty functions do not penalize the mixing weights severely enough and 
cannot shrink them to zero. Instead, we propose to penalize the logarithm of mixing weights. When 
some mixing weights are shrunk to zero, the objective function of the proposed method changes 
continuously, and hence we can investigate its statistical properties, especially the consistency of 
the proposed penalized likelihood method. 

The rest of the paper is organized as follows. In Section 2, we propose a new penalized likeli- 
hood method for finite multivariate Gaussian mixture models. In Section 3, we derive asymptotic 
properties of the estimated number of components. In Section 4, simulation studies and a real data 
analysis are presented to illustrate the performance of our proposed methods. Some discussions are 
given in Section 5. Proof will be delegated in the Appendix. 

2 Gaussian Mixture Model Selection 
2.1 Penalized Likelihood Method 

Gaussian mixture model (GMM) models the density of a ti-dimensional random variable x as a 
weighted sum of some Gaussian densities 



M 




(2.1) 



m=l 
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where 0(x; /x m , S m ) is a Gaussian density with mean vector and covariance matrix S m , and 7r m 
are the positive mixing weights or mixing probabilities that satisfy the constraint 5^m=i 1Tm = ^" ^ or 
identifiability of the component number, let M be the smallest integer such that all components 
are different and the mixing weights are nonzero. That is, M is the smallest integer such that 
7r m > for 1 < m < M, and (/x Q ,5] a ) 7^ (/it;,, S5) for 1 < a ^ b < M. Given the number of 
components M, the complete set of parameters of GMM, 6 = Si, • • • , fi M , Sm, tti, • • • , ttm}, 
can be conveniently estimated by maximum likelihood method via the EM algorithm. To avoid 
overfitting and underfitting, an important issue is to determine the number of components M. 

Intuitively, if some of the mixing weights are shrunk to zero, the corresponding components are 
eliminated and a suitable number of components is retained. However, this can not be achieved 
by directly penalizing mixing weights 7r m . By considering the indicator variables yi m that show 
if the ith observation arises from the mth component as missing data, one can find the expected 
complete-data log-likelihood function (pp. 48, McLachlan and Peel, 2000): 

E(£(0)) = EilogJI/^fl) I = E^£y im [log7r m + log<Kx i ;/i m ,E m )] 

I i=l J I i=l m=l 

n M n M 

= ^2 ^2 h im \ogix m + ^2 ^mlog0(xi;/x m ,E m ), (2.2) 

i=l m=l i=l m=l 

where £(6) is the complete-data log-likelihood, and hi m is the posterior probability that the ith 
observation belongs to the mth component. Note that the expected complete-data log- likelihood 
involves log7r m , whose gradient grows very fast when 7r m is close to zero. Hence the popular L p 
types of penalties may not able to set insignificant 7r m to zero. 

Below we give a simple illustration on how the likelihood function changes when a mixing 
probability approaches to zero. In particular, a data set of 1000 points is randomly generated 
from a bivariate Gaussian distribution (i.e., a GMM with only one component). A GMM with two 
components, /(x) = 7Ti0(x; u%, <xi) + (1 — 7Ti)</>(x; U2, S2), is then used to fit the data. The learned 
two Gaussian components are depicted in FigureQa), and tt\ is 0.227. Furthermore, to see how the 
negative likelihood function changes with respective to it, let tt\ gradually approach zero. For each 
fixed 7Ti, we optimize all other parameters, {/Xj, Sj, i = 1,2}, by maximizing the likelihood function. 
Figure [ijb) depicts how the minimized negative log-likelihood function changes with respective to 
log7Ti. It shows that the log-likelihood function changes almost linearly along with log(-7ri) when 
7Ti is close to zero, albeit some small upticks. In other words, the derivative of the log-likelihood 
function with respective to n\ is approximately proportional to 1/tti when 7?i is close to zero, and 
it would dominate the derivative of Consequently L p penalties can not set insignificant tt\ to 
zero. 

By the discussion above, we know that Li-type penalized likelihood methods are not omnipotent, 
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(a) (b) 

Figure 1: An illustration on the behavior of the negative log-likelihood function when a mixing 
probability is close to zero, (a) The simulated data set and a learned two-component GMM model, 
(b) The minimized negative log-likelihood as a function o/log-zri. Note that the x-axis is in log 
scale. 

especially when the model is not a regular statistical model. In fact, the expected complete-data 



log-likelihood function (2.2) suggests that we need to consider some types of penalties on log7r m 
in order to achieve the sparsity for tt = {tti, • • • , ttm}- In particular, we simply choose to penalize 
log( £+ ^ rm ) = log(e+7r m ) — log(e), where e is a very small positive number, say 10~ 6 or o(ri~5 log -1 n) 
as the discussion of Theorem 3.2. Note that log(e+7r) — log(e) is a monotonically increasing function 
of tt, and it is shrunk to zero when the mixing weight tt goes to zero. Therefore, we propose the 
following penalized log-likelihood function, 

M 

t P {0) = t{9) - n\D f [log(e + vr m ) - log(e)] , (2.3) 

m=l 

where £(6) is the log-likelihood function, A is a tuning parameter, and Df is the number of free 
parameters for each component. For GMM with arbitrary covariance matrices, each component 
has Dj = 1 + d + d{d + l)/2 = d 2 /2 + 3d/2 + 1 number of free parameters. Although here Df is a 
constant and can be removed from the equation above, it would simplify the search range of A in 
the numerical study. 

Note our penalty function is similar to that derived with the Dirichlet prior from Bayesian 
point of view, both using logarithm function of the mixing weights of the finite mixture model as 
the penalty function or prior distribution function. However, for the Dirichlet prior, the objective 
penalized likelihood function penalizes log7Tj, and unlike our proposed penalty function log(e + 
TT m ) — log(e), zero is not in the support area of the penalty function log7Tj. In the mathematical 
sense, these Bayesian approaches can not shrink the mixing weights to zero exactly since zero is 
not well defined for the objective function. In other words, the objective function is not continuous 
when some of mixing weights shrunk continuously to zero. As the discussion by Fan and Li (2001), 
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such discontinuity poses challenges to investigate the statistical properties of related penalized 
or Bayesian methods. It is be one of main reasons why there is little literature on studies the 
consistency of the proposed methods based on the Dirichlet prior. Moveover, when e = 0, our 
penalty function can not be seen as in a border family of Dirichlet distributions. In fact, when 
e = 0, there is no definition for the second term of our proposed penalty function. Though when 
e > 0, our proposed penalty function is similar to the one used by Figueirdo and Jain (2002). 
However, zero is still not in the support area of their proposed improper prior function, and their 
method has similar problems as those Bayesian approaches based on the Dirichlet prior do. 

As discussed in Fan and Li (2001), a good penalty function should yield an estimator with 
three properties: unbiasedness, sparsity, and continuity. It is obvious that log( e+ ^ m ) would over 
penalize large 7r m and yield a biased estimator. Hence, we also consider the following penalized 
log-likelihood function, 

M 

£ P (9) = £{9) - n\D f £ [log(e + Px (ir m )) - log(e)] . (2.4) 

m=l 



Compared to (2.3), the only difference is that ir m is replaced by p\(7r m ) in the penalty function, 
where p\(ir) is the SCAD penalty function proposed by Fan and Li (2001) and is conveniently 
characterized through its derivative: 

= Ifr < A) + ( ^rff A + Ifr > A), 

for some a > 2 and ir > 0. It is easy to see that, for a relatively large ir m and 7r m > aX, p\(ir m ) is 
a constant, and henceforth the estimator of this iT m is expected be unbiased. 

2.2 Modified EM Algorithm 



Here we propose a modified EM algorithm to maximize (2.3) and (2.4) iteratively in two steps. 

First we introduce a modified EM algorithm to maximize (2.3). By (2.2) and (2.3), the expected 
penalized log-likelihood function is 

n M n M M 

^ y h im logTT m + y~] y] /i im log0(xi;/i m ,E m ) -nXDf [log(e + 7r m ) - log(e)] . (2.5) 

7=1 m=l 7=1 771=1 777=1 

In the E step, we calculate the posterior probability given the current estimate 9 = Si, ... , 
A*M) , 7Ti, • • • , v?m) 

, _ 7rm<M x ?.; Mm) E m ) 

z^7)7=i 7r m0(xi; n m i S m ) 

In the M step, we update 9 = Si, • • • , fi M , Sm, tti, ■ • • , ^m} by maximizing the expected pe- 



nalized log-likelihood function (2.5). Note that we can update {ni, ■ ■ ■ , ttm} and {/2 l5 Si, • • • , fj, M , Sm} 
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separately as they are not intervened in (2.5). To obtain an estimate for ir = (tti,--- , 7Tm), we 
introduce a Lagrange multiplier (3 to take into account for the constraint X^m=i n "m = 1j an d aim 
to solve the following set of equations, 

n M M M 



d 
dir m 



^ hm log 7T m - nXDf ^2 lo §( e + %) ~ PC)Z Km - 1) 



.i=l m=l m=l m=l 

Given e is very close to zero, by straightforward calculations, we obtain 

1 



0. 



■K m = max < 0, 



1 - MXD 



f 



n ^-^ 

i=l 



XD 



(2.6) 



The update equations on {/i l5 Si, • • ■ , /x M , Sy} are the same as those of the standard EM algo- 
rithm for GMM (McLachlan and Peel, 2000). Specifically, we update \i m and S m as follows, 

n n n n 

f^m — ^ ] him'X^J ^ himi S m — ^ hi m (x.i /J. m )(xj f^m) I ^ ~] him- 

i=l i=l i=l i=l 

In summary the proposed modified EM algorithm works as follows: it starts with a pre-specified 



large number of components, and whenever a mixing probability is shrunk to zero by (2.6), the 
corresponding component is deleted, thus fewer components are retained for the remaining EM 
iterations. Here we abuse the notation M for the number of components at beginning of each EM 
iteration, and through the updating process, M becomes smaller and smaller. For a given EM 
iteration step, it is possible that none, one, or more than one components are deleted. 



The modified EM algorithm for maximizing (2.4) is similar to the one for (2.3), and the only 
difference is in the M step for maximizing it. Given the current estimate (ir®, . . . , n^) for fj to 
solve 







d-TTr, 



' n M 

yi 

,i=l m=l 



M 



M 



himlogTT m - nXDf ^ k>g(e + p\(n m )) - fi(y~] 7r m - 1) 



m=l 



m=l 



0, 



we substitute log(e + p\{K m )) by its linear approximation log(e + Px(^ m )) + i~r^ \ {^m — n m ) in 



the equation above. Then by straightforward calculations, 7r m can be updated as follows 



where 



7Tn 



M 



1 n 



(2.7) 



i=i 



n — n 



in — 1 

If an updated 7r m is smaller than a pre-specified small threshold value, we then set it to zero and 



remove the corresponding component from the mixture model. In numerical study, (2.7) is seldom 
exactly equal to zero. Using such threshold method to set mixing weight to zero is only to avoid 
the numerical unstabilities. Because of the consistent properties of such penalty function derived 
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in Section 3, we can set this pre-specified small threshold value as small as possible, though small 
threshold value would increase iterative steps of EM algorithm and computation time. In our 
numerical studies, we set this threshold as 10 -4 , but the smallest mixing weight in the following 
two numerical examples are 1/600 and 1/1000, both larger than this threshold. 



2.3 Selection of Tuning Parameters 



To obtain the final estimate of the mixture model by maximizing (2.3) or (2.4), one needs to select 



the tuning parameters A and a (the latter is involved (2.4)). Our simulation studies show that the 



numerical results are not sensitive to the selection of a and therefore by the suggestion of Fan and 
Li (2001) we set a = 3.7. For standard LASSO and SCAD penalized regressions, there are many 
methods to select A, such as generalized cross-validation (GCV) and BIC (See Fan and Li, 2001 
and Wang et al, 2007). Here we define a BIC value 



BIC(A) = ^log 



i=l 



M 

£ 

m=l 



7Tr> 



( x ii Mm; ^mj 



1 



-MD f logn 



and select A by 



argmaxBIC(A), 

A 



where M is the estimate of the number of components and fl m and S m are the estimates of fi r 



and S m for maximizing (2.3) or (2.4) for a given A 



3 Asymptotic Properties 

It is possible to extend our proposed model selection method to more generalized mixture models, 
However, to illustrate the basic idea of the proposed method without many mathematical difficulties, 
in this section, we only show the model selection consistency of the proposed method for Gaussian 
mixture models. 

First, we assume that, for the true Gaussian mixture model, there are q mixture components, 
q < M with TTi = 0, for i = 1, . . . , M - q, vr; = 7if , for i = M - q + 1, . . . , M, I = 1, . . . , q. Then 
by the idea of locally conic models (Dacunha-Castelle and Gassiat, 1997 and 1999), the density 
function of the Gaussian mixture model can be rewritten as 

M-q q 

/(x, 0) = /(x, 9,(3)=J2 W ■ <K/*i> =i) + X>° + PiO) ■ Hrf + K, S? + 66&. 

where 

(3 = (Al, . . . , \M-qi Ml, • • • , VM-q, ^1, • ■ , ^M-q,^, ...,6^,5%,... ,S^,P1, ■ ■ ■ ,Pq), 
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Hi, i = l,...,q, and i = 1, . . . , q, are the true values of multivariate normal components, and 
(ttx, . . . , ttm ) m the original Gaussian mixture model can be defined as 7Tj = Aj#, i = 1, . . . , M — q 
and 7Tj = 7r^ + piO, i = M — q + 1, . . . , M, I = 1, . . . , q. Similar to Dacunha-Castelle and Gassiat 
(1997, 1999), by the restrictions imposed on the j3: 

A, > 0, ft e R d , and S 1 GR M 1 i = l,...,M-g, 
<$J, E R d , 4 G R dXd , and p € R, I = 1, . . . , q, 

M—q q M—q 11 <? 

£> + X> = and £A? + £p 2 + £||^|| 2 + ]>>y 2 = l, 

i=l Z=l i=l z=i 1=1 1=1 

and by the permutation, such a parametrization is locally conic and identifiable. 



After the parametrization, the penalized likelihood functions (2.3) and (2.4) can be written as 

M 



£p(6) = £(G)-n\D f ^ [log(e + vr m )) - log(e) 



m=l 



^ log /(xf, 0, (3) -nXDfJ^ pog(e + vr m ) - log(e)] , (3.1) 



and 



i=l m=l 



M 

M#) = i(e)-n\D f J2 [log(e + p A (^ m ))-log(e)] 



m=l 



£ P (<?,/3) 



= ^log/(x i ,e,j9)-nAZ> / ^[Iog(e + p A (7r m ))-log(e)], (3.2) 

i=l m=l 

respectively. 

We need the following conditions to derive the asymptotic properties of our proposed method. 
PI: Hftll ^ Ci> 11^*11 — ^2, i = 1, . . . , M, where C\ and C2 are large enough constants. 

P2: min{Afc(Sj), k = 1, . . . , d, i = 1, . . . , M} > C3, where Afc(5]j) are the eigenvalues of I], and C3 
is a positive constant. 

Compared to the conditions in Dacunha-Castelle and Gassiat (1997, 1999), the conditions PI 
and P2 are slightly stronger. Without lose of generality, we assume that the parameters in the 
mixture model are in a bounded compact space not only for mathematical conveniences, but also 
for avoiding the identifiability and ill-posedness problems of the finite mixture model as discussed 
in Bishop (2006). Those conditions are also practically reasonable for our revised EM algorithm as 
the discussion in Figueirdo and Jain (2002). 
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First, even if it is known there is K mixture components in the model, the maximum likelihood 
or penalized maximum likelihood solution still has a total of K\ equivalent solutions corresponding 
to the Kl ways of assigning K sets of parameters to K components. Although this is an important 
issue when we wish to interpret the estimate parameter values for a selected model. However, 
the main focus of our paper is to determine the order and find a good density estimate with the 
finite mixture model. The identifiability problem is irrelevant as all equivalent solutions have same 
estimate of the order and the density function. 

Second, Condition (P2) is imposed to guarantee the non-singularity of the covariance matrices, 
avoiding the ill-posedness in the estimation of the finite multivariate Gaussian mixture model 
with unknown covariance matrices using our proposed penalized maximum likelihood method. 
Similar as the discussion in Figueirdo and Jain (2002), given a sufficient large number of the 
initial mixture components, say M, our proposed modified EM algorithm selects components in a 
backwards manner by merging smaller components into a large component, and thus reducing the 
number of components. On the other hand, pre-specifying an extreme large value for M should be 
avoided as well. In our numerical studies, the initial number of mixture components M is set to be 
smaller than n/p so that each estimated mixture component has a positive covariance matrix, as 
required by Condition (P2). 

Theorem 3.1 Under the conditions (PI) and (P2) and if y/nX — > oo, X — > and e = o(\/y/n), 
there exists a local maximizer (6,(3) of ip, which was given in (3.2), such that 9 = O p (\/ y/n), and 
for such local maximizer, the number of the mixture components q n — >■ q with probability tending to 
one. 

Theorem 3.2 Under the conditions (PI) and (P2) and if \fnX — > C and e = °( ^i" ogn ) where 
C is a constant, there exists a local maximizer (0,(3) of ip, which was given in (3.1), such that 
6 = O v (\j \fn), and for such local maximizer, the number of the mixture components q n — > q with 
probability tending to one. 

Remarks: Under Conditions (PI) and (P2) and with an appropriate tuning parameter A, the 
consistency of our proposed two penalized methods have been shown by the two theorems above. 
Although maximizing (3.2) using our proposed EM algorithm is a bit complicated than that for 
(3.1), it has theoretical advantages. Unlike (3.2), a mixture component with a relative large mixing 
weight -Ki is still penalized in (3.1) by the penalty function log(7Tj + e) — log e, and this would produce 
a bias model estimation and affect the consistency of the model selection as the discussion by Fan 
and Li (2001). Moreover, in practice it is easier to select an appropriate tuning parameter for (3.2) 
than for (3.1) to guarantee the consistency of the final model selection and estimation. In particular, 
the following theorem shows that the proposed BIC criterion always selects the reasonable tuning 
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parameter with probability tending one by which maximizing (3.2) selects the consistent component 
number of the finite Gaussian mixture model. 

Let Component A denote the number of component of Gaussian mixture models selected by (3.2) 
with the tuning parameter A, and Xbic is the lambda selected by the proposed BIC criterion in 
Section 2.3. Then we have the following theorem. 

Theorem 3.3 Under the conditions (PI) and (P2), Pr(Component^ B/c = q) — > 1. 
The proofs of the theorems are given in the appendix. 



4 Numerical Studies 
4.1 Example I 

In the first example, we generate 600 samples from a three-component bivariate normal mixture with 
mixing weights tt\ = 1T2 = ^3 = 1/3, mean vectors /x x = [— 1, 1] T , /x 2 = [1, 1] T , /i 3 = [0,— v / 2] T , 
and covariance matrices Si = [0.65, 0.7794; 0.7794, 1.55], S 2 = [0.65,-0.7794; - 0.7794, 1.55], 
S3 = diag{2, 0.2}. In fact, these three components are obtained by rotating and shifting a common 
Gaussian density AA(0, diag(2, 0.2)), and together they exhibit a triangle shape. 



We run both of our proposed penalized likelihood methods (2.3) and (2.4) for 300 times. The 
maximum initial number of components is set to be 10 or 50 , the initial value for the modified 
EM algorithms is estimated by K-means clustering, and the tuning parameter A is selected by 
our proposed BIC method. Figure [2] shows the evolution of the modified EM algorithm, with 
the maximum number of components as 10. We compare our proposed methods with traditional 
AIC and BIC methods. Figure [3](a-c) shows the histograms of the estimated component numbers. 
One can see that our proposed methods perform much better in identifying the correct number of 
components than AIC and BIC methods do. In fact, both proposed methods estimate the number of 
components 100% correctly regardless of the maximum initial number of components. Figure |3^d) 



depicts the evolution of the penalized log-likelihood function (2.3) for the simulated data set in 
Figure (2jja) in one run, and shows how our proposed modified EM algorithm converges numerically. 

In addition, when the number of components is correctly identified, we summarize the estimation 
of the unknown parameters of Gaussian distributions and mixing weights in Table 1 and Table 2 with 
different maximum initial number of components. For the covariance matrix, we use eigenvalues 
because the three components have the same shape as AA(0, diag(2, 0.2)). Table 1 and Table 2 
show that the modified EM algorithms give accurate estimate for parameters and mixing weights. 
The final estimate of these parameters is robust to the initialization of the maximum number of 
components. 
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Figure 2: One typical run. (a) a simulated data set. (b) initialization for m = 10 components, 
(c-e) three intermediate estimates for M = 6,5,4, respectively, (f) the final estimate for M = 3. 



Proposed BIC AIC Penalized log— likelihood function 




Figure 3: Histogram of estimated numbers of components, (a) the proposed method (2.3), (b) BIC, 
(c) AIC. (d) The penalized log likelihood function for one typical run. 
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Table 1: Parameter Estimation (The initial number of components M = 10) 



Component 


Mixing Probability 


Mean 


Covariance (eigenvalue) 


1 


( 
( 


?ru 
2.3 
2.4 


) 
) 


0.3333 
0.3342(.0201) 
0.3356(.0187) 


-1 

-0.9911(.0861) 
-1.0022 (.0875) 


1 

1.0169(.1375) 
1.0007(.1428) 


2 

2.0034(.3022) 
2.0205(.2769) 


0.2 

0.1981(.0264) 
0.1973 (.0265) 


2 




?ru 
2.3 
2.4 




0.3333 
0.3317(.0196) 
0.3321 (.0193) 


1 

1.0151(.0845) 
1.0108(.0790) 


1 

0.9849(.1318) 
0.9904(.1253) 


2 

1.9794(.2837) 
1.9825(.2980) 


0.2 

0.1977(.0303) 
0.1957 (.0292) 


3 


( 
( 


?ru 
2.3 
2.4 


) 
) 


0.3333 
0.3341(.0171) 
0.3322(.0159) 




0.0019(.1324) 
0.0014(.1449) 


-1.4142 
-1.4112(.0405) 
-1.4103(.0404) 


2 

1.9722(.2425) 
1.9505(.2424) 


0.2 

0.1973(.0258) 
0.1978(.0267) 



Table 2: Parameter Estimation (The initial number of components M = 50) 



Component 


Mixing Probability 


Mean 


Covariance (eigenvalue) 




True 


0.3333 


-1 


1 


2 


0.2 


1 




2.3 
2.4 




0.3342(.0201) 
0.3320(.0190) 


-1.0080(.0881) 
-1.0017 (.0859) 


0.9854(.1372) 
0.9985(.1389) 


1.9603(.2857) 
1.9604(.2830) 


0.1974(.0296) 
0.1960 (.0286) 




True 


0.3333 


1 


1 


2 


0.2 


2 


( 
( 


2.3 
2.4 


) 
) 


.3347(.0170) 
0.3345 (.0182) 


0.9879(.0885) 
0.9987(.0896) 


1.0166(.1385) 
1.0044(.1402) 


1.9531(.2701) 
1.9661(.2460) 


0.1981(.0283) 
0.1971 (.0248) 




True 


0.3333 





-1.4142 


2 


0.2 


3 




2.3 
2.4 




0.3329(.0198) 
0.3334(.0164) 


0.0210(.1329) 
0.0117(.1302) 


-1.4105(.0344) 
-1.4116(.0372) 


1.9717(.2505) 
1.9736(.2769) 


0.1975(.0265) 
0.1998(.0281) 



4.2 Example II 



In the second example, we consider a situation where the mixture components overlap and may 
have same means but different covariance matrices. This is a rather challenging example, and the 
proposed method by Chen and Khalili (2008) can not be applied as some components have the same 
mean. Specifically, we generate 1000 samples with mixing weights 7Ti = txi = tt^ = 0.3, ir^ = 0.1, 
mean vectors = fj, 2 ~- 



-2, -2} T , n 3 = [2, 0] T , /x 4 = [1, -4] T , and 



Si 



0.1 








0.2 


0.5 








4 



2 2 
2 7 

0.125 
0.125 



Similar to the first Example, we run our proposed methods for 300 times. The maximum 
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number of components is set to be 10 or 50, the initial value for the modified EM algorithms is 
estimated by K-means clustering, and the tuning parameter A is selected by our proposed BIC 
method. Figure [4] shows the evolution of the modified EM algorithm for (2.3) with the maximum 
initial number of components as 10 for one simulated data set. Figure [5] shows that our proposed 
method can identify the number of components 100% correctly, and performs much better than 
AIC and BIC methods. Table 3 and Table 4 show that the modified EM algorithms give accurate 
estimates for parameters and mixing weights. Similar as the first example, the final estimate of 
these parameters is robust to the initialization of the maximum number of components. 





Figure 4: One typical run. (a) a simulated data set. (b) initialization for M = 10 components, 
(c-e) three intermediate estimates for M = 7,6,5, respectively, (f) the final estimate for M = 4. 



4.3 Real Data Analysis 

We apply our proposed methods to an image segmentation data set at UCI Machine Learning 



Repository (http://archive.ics. uci.edu/ml/datasets/Image+Segmentation). This data set was cre- 



ated from a database of seven outdoor images (brickface, sky, foliage, cement, window, path and 
grass). Each image was hand-segmented into instances of 3 x 3 regions, and 230 instances were 
randomly drawn. For each instance, there are 19 attributes. We here only focus on four images, 
brickface, sky, foliage and grass, and two attributes, extra red and extra green. Our objective is 
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Figure 5: Histogram of estimated numbers of components, (a) the proposed method (2.3), (b) BIC, 
(c) AIC. (d) The penalized log likelihood function for one typical run. 



Table 3: Parameter Estimation (The initial number of components M = 10) 



Component 


Mixing Probability 


Mean 


Covariance (eigenvalue) 


1 


( 
( 


?ru 
2.3 
2.4 


) 
) 


0.3 

0.3022(.0093) 
0.3009(.0095) 


-2 

-2.0010(.0216) 
-1.9995(.0206) 


-2 

-1.9989(.0291) 
-1.9975(.0319) 


0.1 

0.0979(.0114) 
0.0990(.0119) 


0.2 

0.2010 (.0242) 
0.2003(.0226) 


2 


( 
( 


?ru 
2.3 
2.4 


) 
) 


0.3 

0.2995(.0112) 
0.3017(.0118) 


-2 

-1.9989(.1133) 
-1.9995(.1202) 


-2 

-1.9963(.1837) 
-2.0049(1811) 


1.2984 
1.2864(.1407) 
1.2926(.1343) 


7.7016 
7.7219(.7301) 
7.5856(.7301) 


3 




?ru 
2.3 
2.4 




0.3 

0.3019(.0083) 
0.3012(.0087) 


2 

1.9943(.0483) 
1.9995(.0511) 




0.0001(.1294) 
-0.0001(.1244) 


0.5 

0.4986(.0529) 
0.4963(.0544) 


4 

3.9951(.3496) 
3.9998(.3911) 


4 


( 
( 


?ru 
2.3 
2.4 


) 
) 


0.1 

0.0964(.0038) 
0.0962(.0047) 


1 

1.0005(.0373) 
0.9993(.0394) 


-4 

-3.9966(.0394) 
-4.0013(.0417) 


0.125 
0.1143(.0245) 
0.1167(.0259) 


0.125 
0.1339(.0252) 
0.1317(.0278) 



to estimate the joint probability density function of the two attributes (See Figure [6^ a)) using a 
Gaussian mixture with arbitrary covariance matrices. In other words, we implement our proposed 
method to identify the number of components, and to simultaneously estimate the unknown pa- 
rameters of bivariate normal distributions and mixing weights. Although we consider only four 
images, Figure [6^a) suggests that a five-component Gaussian mixture is more appropriate and the 
brickface image is better represented by two components. 
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Table 4: Parameter Estimation (The initial number of components M = 50) 



Component 


Mixing Probability 


Mean 


Covariance (eigenvalue) 


1 


( 
( 


?ru 
2.3 
2.4 


) 
) 


0.3 

0.3016(.0107) 
0.3009(.0095) 


-2 

-1.9982(.0223) 
-1.9986(.0218) 


-2 

-1.9998(.0312) 
-1.9978(.0320) 


0.1 

0.0986(.0110) 
0.0993(.0110) 


0.2 

0.2034 (.0241) 
0.2010(.0238) 


2 




?ru 
2.3 
2.4 




0.3 

0.3002(.0128) 
0.3017(.0115) 


-2 

-2.0040(.1086) 
-1.9988(.1173) 


-2 

-2.0052(.1819) 
-2.0116(.1811) 


1.2984 
1.2823(.1386) 
1.2757(.1318) 


7.7016 
7.6696(.7538) 
7.6734(.7476) 


3 


( 
( 


?ru 
2.3 
2.4 


) 
) 


0.3 

0.3015(.0083) 
0.3012(.0084) 


2 

1.9986(.0500) 
2.0015(.0505) 




0.0054(.1365) 
0.0102(.1268) 


0.5 

0.4998(.0531) 
0.4915(.0524) 


4 

3.9951(.3651) 
3.9751(.3770) 


4 


( 
( 


?ru 
2.3 
2.4 


) 
) 


0.1 

0.0966(.0044) 
0.0962(.0050) 


1 

0.9983(.0408) 
1.0011(.0402) 


-4 

-4.0019(.0431) 
-4.0019(.0425) 


0.125 
0.1150(.0251) 
0.1154(.0256) 


0.125 
0.1327(.0258) 
0.1313(.0254) 





iJiiIIIILjII 



(b) 



(c) 



Figure 6: (a) Scatter plot of scaled real data. Brickface (blue), Sky (green), Foliage (red), Grass 
(light blue); (b-c) Histograms of marginal density, (b) Extra red, and (c) Extra green. 



As in the simulation studies, we run our proposed method for 300 times. For each run, we 
randomly draw 200 instances for each images. The maximum number of components is set to be 
ten, and the initial value for the modified EM algorithm is estimated by the K-means clustering. 
Because there is little difference between numerical results of the two proposed methods ( |2.3[ ) and 
(2.4), here we only show the numerical results obtained by maximizing (2.3). Figure [7] shows the 
evolution of the modified EM algorithm for one run. Figure [8] shows that our proposed method 
selects five components with high probability. For a five-component Gaussian mixture model, we 
summarize the estimation of parameters and mixing weights in Table 5. 
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Figure 7: One typical run. (a) Initialization with M = 10 components, (b) and (c) two intermediate 
estimates for M = 7 and M = 6, respectively, (d) the final estimate M = 5. 




Figure 8: Histogram of estimated numbers of components, (a) the proposed method (2.3), (b) BIC, 
(c) AIC. (d) The penalized log likelihood function for one typical run. 



Table 5: Parameter Estimation, M = 5 



Component 


Underlying 


Mixing Probability 


Mean 


Standard Deviation 








(Ex-red, Ex-green) 


(Ex-red, Ex-green) 


1 


Sky & Foliage 


.4153(0.0555) 


-27.7689(2.3336) 


-13.7343(1.1377) 


12.0464(1.1726) 


7.3739(0.3745) 


2 


Grass 


.2617(0.0523) 


-9.3724(0.8588) 


13.4992(4.0740) 


3.6393(1.2160) 


3.5632(1.3567) 


3 


Foliage & Brickface 


.1447(0.0425) 


-4.9511(1.3913) 


-3.3625(3.5537) 


3.1584(0.7102) 


1.6728(0.4004) 


4 


Brickface 


.0824(0.0487) 


-1.3474(0.8417) 


-12.0906(7.2158) 


2.5780(1.5227) 


1.3302(0.7854) 


5 


Brickface 


.0936(0.0717) 


3.5476(1.4703) 


-8.4996(2.1980) 


1.5110(1.2288) 


1.3293(1.6460) 
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5 Conclusions and Discussions 



In this paper, we propose a penalized likelihood approach for multivariate finite Gaussian mixture 
models which integrates model selection and parameter estimation in a unified way. The proposed 
method involves light computational load and is very attractive when there are many possible can- 
didate models. Under mild conditions, we have, both theoretically and numerically, shown that our 
proposed method can select the number of components consistently for Gaussian mixture models. 
Though we mainly focus on Gaussian mixture models, we believe our method can be extended 
to more generalized mixture models. This requires more rigorous mathematical derivations and 
further theoretical justifications, and is beyond the scope of this paper. 

In practice, our proposed modified EM algorithm gradually discards insignificant components, 
and does not generate new components or split any large components. If necessary, for very complex 
problems, one can perform the split-and-merge operations (Ueda et al. 1999) after certain EM 
iterations to improve the final results. We only show the convergence of our proposed modified 
EM algorithm through simulations, and further theoretical investigation is needed. Moreover, 
classical acceleration methods, such as Louis' method, Quasi-Newton method and Hybrid method 
(McLachlan and Peel, 2000), may be used to improve the convergence rate of our proposed modified 
EM algorithm. 

Another practical issue is the selection of the tuning parameter A for the penalized likelihood 
function. We propose a BIC selection method, and simulation results show it works well. Moreover, 
our simulation results show the final estimate is quite robust to the initial number of components. 



In this paper, we propose two penalized log-likelihood functions (2.3) and (2.4). Although 
the numerical results obtained by these two penalized functions are very similar, we believe they 
have different theoretical properties. We have shown the consistency of model selection and tuning 



parameter selection by maximizing (2.4) and the BIC criterion proposed in the paper under mild 



conditions. We have also shown the consistency of model selection by maximizing (2.3), but we 



note that the conditions are somewhat restrictive. In particular, the consistency of BIC criterion 



to select the tuning parameter for (2.3) needs further investigations. 

An ongoing work is to investigate how to extend our proposed penalized likelihood method to 
the mixture of factor analyzers (Ghahramani and Hinton, 1997), and to integrate clustering and 
dimensionality reduction in a unified way. 

Appendix: Proof 

We now outline the key ideas of the proof for Theorem 3.1. 
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First assume the true Gaussian mixture density is 



0> = £ E?), 



1=1 

then define V as the subset of functions of form 

^ So ^ ' ^ So + Ai 50 + 

1=1 i=l yu 1=1 i>j=l yU i=l yU 1=1 yU 

where is the derivative of 4>(pi, Sj) for the ith component of Hi, is the derivative of </>(/Xj, 5],) 
for the component of Sj. For functions in D, (/ij, Sj), i = 1, . . . , M — q and /it , S° satisfy the 
conditions PI and P2. The important consequence for T> is the following proposition. 

Proposition A.l. Under the conditions PI and P2, T> is a Donsker class. 

Proof: First, by the conditions PI and P2, it is easy to check that the conditions PI and P2 in 
Keribin (2000) or P0 and PI in Dacunha-Castelle and Gassiat (1999) are satisfied by V. Then 
(D t V(A*i,S?),D|- >J X/iP,Sj , ),^(Ai i ,E i )) are within envelope functions (F 1 ,F 2 ,F 3 ), and square m- 
tegrable under gov. On the other hand, by the restrictions imposed on the (3, Aj, pi, ^fS 1 ^. and 
t^iSj^. are bounded. Therefore, similar to the proof of Theorem 4.1 in Keribin (2000) or the proof 
of Proposition 3.1 in Dacunha-Castelle and Gassiat (1999), it is straightforward to show that T> 
has the Donsker property with the bracketing number N(e) = l/e K where K = M(d+d(d+l)/2). □ 

Proof of Theorem 3.1: To prove the theorem, we first show that there exists a maximizer 
(9,(3) such that 6 = O p (\/y/n). In fact, it is sufficient to show that, for a large constant C, 
t(6, (3) < £(0, (3) where 9 = C/y/n. Let 9 = C/y/n, and notice that 

n M 

l p {9,(3) - £ p (0,(3) = ^{log/( Xj ,#,/3) - logflbfc)} - n\D f £ [log(e +p A (vr m )) - log(e)] 

i=l m=l 

q 

+n\D f J][log(e + px(ir )) - log(e)], 
l=i 

and then 

n 

l p (e,(3)-l p (0,(3) < ^{log/(x i ,0,/3)-log 5o (x i )} 

i=i 

M 

-n\D f [log(e+p A (7r m )) ~^g{e + p x (-K Q m _ M+q ))} 

m=M-q+l 

= h+h- 
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For I2, because 9 = C j ^Jn and by the restriction condition on pi, I = l,...,q, we have \ir m — 
-M+q\ — CI y/n when m > M — q. Due to the property of the penalty function, we then have 

M 

\h\ = I - nXDf [log(e+p A (vr m )) - log(e +p A (7r° t _ M+g ))]| 

m=M— q+1 
M 

= \-nXD f ^2 [log(e + aA) -log(e + aA)]| 

m=M-q+l 

= 0. 

For Ji, we have 

T /(xi,6>,/3) -go(xj) 1 y-r / /(x,,6>,/3) - ff (xi) \ 2 1 A TJ ( f(x-i,0,P) - go{x-i) 

holds for 6 = C ' j 1 \fn, where \Ui\ < 1. Expand f(x.,9,0) up to the second order, 

/(x, 0, /3) = 50 (x) + 9 • /'(x, 0, 0) + y ■ f" (x, r , /3), 

for &9* <9. 

Noticing = C/y/n, Ef /go = and Ef"/go = 0, and by the conditions PI, P2 and Proposition 
A.l for the class V, we have 

Since -4= ^ ~ ^(x-'f^ converges uniformly in distribution to a Gaussian process by Proposition 
i=i 

11 ( ffx- 3) \ 2 

A.l, and ( g (x ) ) ^ s °^ or der O p (n) by the law of large numbers, we have 

i=i ^ ! 

/1 = -S= • Op(v^) - — • O p {n). 
\Jn n 



When C is large enough, the second term of I\ dominates other terms in the penalized likelihood 
ratio. Then we have 

e p (e,p)-e p (o,p)<o 

with probability tending to one. Hence there exists a maximizer (0, 0) with probability tending to 
one such that 

e = o p (^=). 



Next we show that q = q or that 7? m = 0,m = 1, . . . ,M — q, when the maximizer (6,0) 
satisfies 9 = O p (^). In fact, when 9 = O p (^), we have 7? m = O p (l/y/n), m = 1, . . . , M — q, by 
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the restriction condition on Aj. A Lagrange multiplier ft is taken into account for the constraint 
Ylm=i ^ rn = 1- Then it is then sufficient to show that 

d£* (0} 

— =M < for 5r m < e n (A.l) 
<9vr m 

with probability tending to one for the maximizer (9,(3) where e n = Cn~ 1 / 2 , m < M — q, and 

i ~ !)■ To show the equation above, we consider the partial derivatives 
for 7r m , m > M — q firstly. They should satisfy the following equation, 



9 ^ 1 =it t 7 "^'^ nXD f ^-ft = 0. (A.2) 



It is obvious that the first term in the equation above is of order O p (n) by the law of large numbers. If 
m > M—q and 9 = O p (^), it is easy to know that 7? m = TT { ^ n _ AI+q +O p (\/ ^/n) > |-min(7Tj ) , . . . , ir®), 
and hence the second term should be O p (nX) = o p (n). So we have ft = O p (n). 
Next, consider 

SriS) = ± <MPm,Sm) _„ XD _^_p, (A . 3) 



where m < M — q and 7? m < e n . As shown for (A.3), it is obvious that the first term and the third 



term ft in the equation above are of order O p (n). For the second term, because 7r m = O p (l/y/n), 
/nA —7- oo and e is sufficient small, we have 

\n\Df- — - — - 1 In = XDf — - — = OJVnX) — > oo. 



with probability tending to one. Hence the second term in the equation (A.3) above dominates 



the first term and the third term in the equation. Therefore we proved the equation (A.l), or 
equivalently TT m = 0, m = 1, . . . , M — q with probability tending to one when n — > oo. □ 



Proof of Theorem 3.2: To prove Theorem 3.2, similar as the proof of Theorem 3.1, we first show 
that there exists a maximizer (9,(3) such that 9 = O p (\/^/n) when A = C '/ ' \fn, and it is sufficient 
to show that, for a large constant C\, £(9,(3) < £(0,(3) where 9 = C\/y/n. Let 9 = Ci/^/n, and as 
the similar step of Theorem 3.1, we have 

n 

£ p (9,(3)-£ p (0,(3) < ^{log/(x J ,^,/3)-log 5 o(x l )} 

i=l 

M 

-nXD f ^ [log(e + vr m ) - log(e + 

7T rn-M+g\ 

m=M-q+l 

= h + h- 
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For I2, because of 9 = C\j ^fn and by the restriction condition on pi, I = 1, . . . , q, we have \7r m — 
-M+q\ — Ci/ V™ when m > M — q. By the property of the penalty function, we then have 

M 

|/ 2 | = \-n\D f { l °s(e + 7r m )-log(e + 7r° m _ M+q )}\ 



m=M-q+l 
M 

-nXDf 

m=M-q+l 

qG 

Ri 



(vr m *y M+q ) , (1+o(1)) 



e + vr! 



m— M+5 



= 0( v ^)-^(l + o(l)) = 0(C 1 ). 



For Ji, as in the proof of Theorem 3.1, we have 

I 1 = ^.0 P (Vn)-^-O p (n). 
yjn n 

When C\ is large enough, the second term of I\ dominates I2 and other terms in I\ . Hence we have 

e p (9,p)-e p (o,p)<o 

with probability tending to one. Hence there exists a maximizer (6>, f3) with probability tending to 
one such that 



Next we show that there exists a maximizer (9,/3) satisfies 9 = O p (^) such that q = q or 
Tr m = 0,m = l,...,M -q, 

First, we show that for any maximizer £ p (9* ,/3*) with \9*\ < C\j ' \fn if there is k < M — q such 



that Ci/y/n > 7Tk > 1/y/n logra, then there should exist another maximizer of £ p (9,(3) in the area 
of \9\ < C\j \fn. It means that the extreme maximizer of £ p (9,(3) in the compact area \9\ < Ci/y/n 



should satisfy that < ^ ri l ogn for any k < M — q + 1. Hence it is also equivalent to show for 
any such kind maximizer £ p (9*,f3*) with \9*\ < Ci/y/n, we always have £ P (9*,P*) < £ p (0,(3*) with 
probability tending to one. Similar as the analysis before, we have 



£ p (9*,(3*)-£ p (0,(3*) < ^{log /(x,,^, (3*) -loggofc)} 



M 



-nXDf Yj [ lo g( £ + Ki) ~ lQ g( e + Km-M+q] ~ nXD f lo g 
m=M-q+l 

= h + h + I?,. 

As shown before, we have I\ + I2 = O p (Cf). For ^3, because e = °{ ^\ ogn ) we have 

|/ 3 | = 0(n ■ C/y/n) • log ^ = 0(y/n). 



e 
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Then notice that I3 is always negative and dominate I\ and I2, and hence we have £ p (6, (3) < £ p (0,(3). 

In the following step, we need only consider the maximizer £ p (Q : (3) with \9\ < C\/y/n and 
7?fe < 1 / yjn logn for k < M — q + 1 . 

A Lagrange multiplier @ is taken into account for the constraint Em=i ^ m = It is then 
sufficient to show that 

^^<0 for £ m <_L_ (A.4) 
a7r m v n log n 

with probability tending to one for the maximizer (9,(3) where £*(6) = £(0) — 7r m — 1). 

To show the equation above, we consider the partial derivatives for 7r m , m > M — q firstly. They 
should satisfy the following equation, 

It is obvious that the first term in the equation above is of order O p (n) by the law of large numbers. If 
m > M—q and 6 = O p (-5=), it is easy to know that 7? m = TT^ n _ AlJrq + O p (l/ ' yfn) > ^•min(7T] ) , . . . , tr®), 
and hence the second term should be O p (nX) = o p (n). So we have (3 = O p (n). 
Next, consider 

djm = f &n(*n,Sm) nXD f — ^ /3. (A.6) 



where m < M — q and 7? m < — . As shown for (A.5), it is obvious that the first term 
and the third term /3 in the equation above are of order O p (n). For the second term, because 
7?m = 0»(^ — ), A = C/y/n and e = of -7=7 — ) , we have 

< nXDf- r 1 In = XDf = O v ( A • v/^log ra) — > 00. 

I / (e + vr m )J/ J e + 7r m pv 



with probability tending to one. Hence the second term in the equation (A.6) above dominates 
the first term and the third term in the equation. Therefore we proved the equation (A.4), or 
equivalently 7? m = 0, m = 1, . . . , M — q with probability tending to one when n — > 00. □ 

Proof of Theorem 3.3: By Proposition A.l, and as in the example of Gaussian Case shown 
in Keribin (2000), we know that the Conditions (P1)-(P3) and (Id) are satisfied by Multivariate 
Gaussian mixture model. Hence to prove this theorem, we can use the theoretical results obtained 
by Keribin (2000) and follow the proof step of Theorem 2 in Wang et al. (2007). 

First, given A* = y^p", by Theorem 3.1, we known that q = q with probability tending to 
1, and TTM-q+m, fTi = 1, . . . q is the consistent estimate of 7T- 1 , i = 1, . . . , q. Hence with probability 
tending to 1, we have 

£ P (d x *) = £(0 X *) - nX*D f ■ q ■ log 
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where 9\* is the parameter estimators of the multivariate Gaussian mixture model. On the other 
hand, when q is known, we know its maximum likelihood estimate Omle is consistent. Hence we 
have 

<? 

£p(Omle) = ((Omle) - n\*D f ^ [log(e + p\{n m>M LE)) - log(e)] 

m=l 

= £(e M LE) - nX* D r q- log >£p(0 x *), 

where Tt m ,MLE is the maximum likelihood estimate of 7r^,m = l,...,q. Then by the convex 
property of 1(0) and the definition of £p(0\*), when A* = y / ^p we have the oracle property of 
the penalized estimate of 6 which should be equal to Omle with probability tending to one. 
Next, we can identify two different cases, i.e., underfitting and overfitting. 

Case 1: Underfitted model, i.e., q\ < q. According to the definition of the BIC criterion, we 
have 

BICx = £(0 X ) - ^qxDjlogn < £(0^ MLE ) - ^q x D f logn 

where 0^ MLE is the maximum likelihood estimate of the finite Gaussian mixture model when the 
number of the components is qx- Similar as Keribin (2000), we know that 



l {£(0^,MLE)-£(O q ,MLE)} = \ {t(0^MLE)-t(0x*)} -> -K(g ,G^) 

where Qq x is the finite Gaussian mixture model space with q mixture components. Then we have 
BICx- BICx* < £(Oq x ,MLE)-£(Oq x ,,MLE)-^q\Dflogn+-q x *D f \ogn 

= £(0q x ,MLE) ~ £(O q ,MLE) ~ \<l\ D f lo g n + \l D f lo S n 

= -nK(g ,Gq x (l + o p (l)) + -(q - q x )D f logn 



< o, 



This implies that 



Pr( sup BICx > BICx*) -> 0. (A. 7) 

\:q\<q 

Case 2: Overfitted model, i.e., q\ > q. As Keribin (2000), for p > q, by Dacunha-Castelle 
(1999) we know that 

£(0q x ,MLE) - £(0q,MLE) 

converges in distribution to the following variables 

sup J sup -^l^>o ; sup \ U 2 d + f d l u > ) \ 
[dev * d 1 ev 1 ,d 2 e~D 2 1 v J 



24 



where T>\ and T>i are subsets of a unit sphere H of functions (For detail definition of T>\, T>\ and 
H, see Keribin, 2000). Hence @{@q Xt M le) — ^(fiq,MLE) = O p (l) and we have 

BIC X - BIC X * < £(d ?xMLE )-£(d^ MLE ) - ~q x D f logn + ~q x *D f logn 

= t@qx,MLE) ~ t@q,MLE) ~ ^qxD f logn + ^qD f log n 

1 



O p (l) + -(q -q x )D f log n 



< 0, 



and 



Pr( sup BIC X > BIC X * 



Combined (A. 7) with (A. 8), Theorem 3.3 has been proved. □ 



(A. 
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